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Abstract 

We propose a generic design for Chinese remainder algorithms. A Chi- 
**! ' nese remainder computation consists in reconstructing an integer value 

from its residues modulo non coprime integers. We also propose an ef- 
ficient linear data structure, a radix ladder, for the intermediate storage 
and computations. Our design is structured into three main modules: a 
black box residue computation in charge of computing each residue; a 
y^ • Chinese remaindering controller in charge of launching the computation 

and of the termination decision; an integer builder in charge of the re- 
construction computation. We then show that this design enables many 
different forms of Chinese remaindering (e.g. deterministic, early termi- 
^ I nated, distributed, etc.), easy comparisons between these forms and e.g. 

^—2 . user-transparent parallelism at different parallel grains. 

oo 

^. '. 1 Introduction 

in 

^^ , Modular methods are largely used in computer algebra to reduce the cost of co- 

efficient growth of the integer, rational or polynomial coefficients. Then Chinese 
remaindering (or interpolation) can be used to recover the large coefficient from 
their modular evaluations by reconstructing an integer value from its residues 

K> , modulo non coprime integers. 



LinBoxu[S] is an exact linear algebra library providing some of the most 
5^ , efficient methods for linear systems over arbitrary precision integers. For in- 

stance, to compute the determinant of a large dense matrix over the integers 
one can use linear algebra over word size finite fields |10] and then use a combi- 
nation of system solving and Chinese remaindering to lift the result [T3| . The 
Frobenius normal form of a matrix is used to test two matrices for similarity. 
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Although the Frobenius normal form contains more in formation on the matrix 
than the characteristic polynomial, most efficient algorithms to compute it are 
based on computations of characteristic polynomial (see for example [H]). Now 
the Smith normal form of an integer matrix is useful e.g. in the computation 
of homology groups and its computation can be done via the integer minimal 
polynomial [H]. In both cases, the polynomials are computed first modulo sev- 
eral prime numbers and then only reconstructed via Chinese remaindering using 
precise bounds on the integer coefficients of the integer characteristic or minimal 
polynomials [T51I5]. 

An alternative to the deterministic remaindering is to terminate the recon- 
struction early when the actual integer result is smaller than the estimated 
bound [m [121 HD] . There after the reconstruction stabilizes for some modular 
iterations, the computation is stopped and gives the correct answer with high 
probability. 

In this paper we propose first in section [2] a linear space data structure en- 
abling fast computation of Chinese reconstruction, alternative to subproduct 
trees. Then we propose in section [3] to structure the design of a generic pattern 
of Chinese remaindering into three main modules: a black box residue compu- 
tation in charge of computing each residue; a Chinese remaindering controller 
in charge of launching the computation and of the termination decision; an in- 
teger builder in charge of the reconstruction computation. We show in section 2] 
that this design enables many different forms of Chinese remaindering (e.g. de- 
terministic, early terminated, distributed, etc.) and easy comparisons between 
these forms. We show then in section [5] that this structure provides also an 
easy and efficient way to provide user-transparent parallelism at different par- 
allel grains. Any parallel paradigm can be implemented provided that it fulfills 
the defined controller interface. We here chose to use Kaap1j[TB| to show the 
efficiency of our approach on distributed/shared architectures. 

2 Radix ladder: linear structure for fast Chinese 
remaindering 

2.1 Generic reconstruction 

We are given a black box function which computes the evaluation of an integer 
R modulo any number m (often a prime number). 

To reconstruct R, we must have enough evaluations rj = R mod rrij modulo 
coprimes nij. To perform this reconstruction, we need two by two liftings with 
U = R mod M and V = R mod N as follows: 

Rmn = U+{V -U)x{M-^ mod N)xM. (1) 

We will need this combination most frequently in two different settings: when 
M and N have the same size, and when TV is of size 1. The first generic aspect 
of our development is that for both cases, the same implementation can be fast. 
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We first need a complexity model. We do not give much details on fast 
integer arithmetic in this paper, instead our point is to show the genericity of our 
approach and that it facilitates experiments in order to obtain goods practical 
efficiency with any underlying arithmetic. Therefore we propose to use a very 
simplified model of complexity where division/inverse/modulo/gcd are slower 
than multiplication. We denote by d^P the complexity of the pgcd of integers 
of size I with 1 < a < 2, and ranging from 0{l'^) for classical multiplication to 
0(Z^+^) for FFT-like algorithms, that the complexity of integer multiplication 
of size / can be bounded by nial" (e.g. m2 =2). We refer to e.g. the GMP 
manuaO or [HJ [T3| for more accurate estimates. 

With this in mind we compute formula ([1|) with one multiplication modulo 
as follows: 



Algorithm 1 Reconstruct 



Input: U = R mod M and V = R mod N. 
Output: Rain = R mod M x N. 



Un^V -U mod N; 

Mn ^ A/"^ mod N; 

Un = Un X Mn mod N; 

Rmn = U + Un X M; 

if Rmn > M x N then Rmn 



= Rmn - M X N end if 



Now, if the formula ([!]) is computed via algorithm [T] and the operation counts 
uses column "Mul." for multiplication and "Div./Gcd." for division/inverse/- 
modulo/gcd, then we have the complexities given in column " CRT" of table [TJ 



Size of operands 


Mul. 


Div. 
Gcd. 


CRT 


I X 1 
Ixl 


I 
mj°' 


31 


9^ + 0(1) 



Table 1: Integer arithmetic complexity model 



Proof. • if N is of size 1, then: 

1. Un'- requires 1 division modulo N. 

2. M]y: computes M mod N (1 division) and then the gcd of size 1 is 
0(1). 

3. Un'- requires 1 multiplication of size 1 which is 0(1). 

4. Rmn- requires 1 multiplication / x 1. 

5. M X N: requires 1 multiplication 1x1, then 1 potential addition. 
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Overall, this is2x3Z + 2x; + ; + 0(1) = 91 + 0(1) operations. 
• if If M and N are of size /, then: 

1. Un'- requires 1 addition mod N, complexity 0(1). 

2. Mn: requires 1 gcd. 

3. Un'- requires 1 modular multiplication. 

4. Rain'- requires 1 multiplication. 

5. M X N: requires 1 multiplication and 1 potential addition. 

Overall, this is 0{l) + {2da + 2m„) I". 
2.2 Radix ladder 



D 



Fast algorithms for Chinese remaindering rely on reconstructing pairs of residues 
of the same size. A usual way of implementing this is via a binary tree struc- 
ture (see e.g. figure [1] left). But Chinese remaindering is usually an iterative 
procedure and residues are added one after the other. Therefore it is possible to 
start combining them two by two before the end of the iterations. Furthermore, 
when a combination has been made it contains all the information of its leaves. 
Thus it is sufficient to store only the partially recombined parts and cut its 
descending branches. We propose to use a radix ladder for that task. A radix 
ladder is a ladder composed of successive shelves. A shelf is either empty or 
contains a modulus and an associated residue, denoted respectively Mi and Ui 
at level i. Moreover, at level /, are stored only residues or moduli of size 2'. 
New pairs of residues and moduli can be inserted anywhere in the ladder. If 
the shelf corresponding to its size is empty, then the pair is just stored there, 
otherwise it is combined with occupant of the shelf, the latter is dismissed and 
the new combination tries to go one level up as shown on algorithm [2] 

Algorithm 2 RADixLADDER.insert(t/, A/) 
Input: U = R mod M and a Radix ladder 
Output: Insertion of U and M in the ladder., 

1: for i = size{M) while Shelf[i] is not empty do 

2: y, M :=Reconstruct(C/ mod M, C/j mod Mj); 

3: Pop Shelf [i]; 

4: Increment z; 

5: end for 

6: Push C/, M in Shelf W; 

Then if the new level is empty the combination is stored there, otherwise it 
is combined and goes up ... An example of this procedure is given on figure [T] 
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Figure 1: A residue going up the radix ladder 



Then to recover the whole reconstructed number it is sufficient to iterate 
through the ladder from the ground level and make all the encountered par- 
tial results go to up one level after the other to the top of the ladder. As 
we will see in section I3.3| LinBox-1.1.7 contains such a data structure, in 
|linbox/algorithms/cra-full-multip.^ 

An advantage of this structure is that it enables insertion of any size pair with 
fast arithmetic complexity. Moreover, merge of two ladders is straightforward 
and we will make an extensive use of that fact in a parallel setting in section [5l 



Algorithm 3 RadixLadder. merge 



Input: Two radix ladders RLi and RL2- 
Output: In place merge of RLi and RL2. 



for i = to size(i?L2) do 
i?Li.insert(i?L2-Shelf[i]); 
end for 
Return RLi 



3 A Chinese remaindering design pattern 

The generic design we propose here comes from the observation that there are in 
general two ways of computing a reconstruction; a deterministic way computing 
all the residues until the product of moduli reaches a bound on the size of the 
result ; or a probabilistic way using early termination. We thus propose an 
abstraction of the reconstruction process in three layers: a black box function 
produces residues modulo small moduli, an integer builder produces reconstruc- 
tions using algorithm [21 and a Chinese remaindering controller commands them 
both. 

Here our point is that the controller is completely generic where the builder 
may use e.g. the radix ladder data structure proposed in section [2] and has to 
implement the termination strategy. 



3.1 Black box residue computation 

In general this consists in mapping the problem from Z to 'L/m'^ and com- 
puting the result modulo m. Such black boxes are defined e.g. for the de- 
terminant, valence, minpoly, charpoly, linear system solve as function objects 
|IntegerMod.ulaf*| (where[*]is one of the latter functions) in the |linbox/solutions| 
directory of LinBox-1.1.7. 

3.2 Chinese remaindering controller 

The pattern we propose here is generic with respect to the termination strategy 
and the integer reconstruction scheme. The controller must be able to initialize 
the data structure via the builder ; generate some coprime moduli ; apply the 
black box function ; update the data structure ; test for termination and output 
the reconstructed clement. The generations of moduli and the black box are 
parameters and the other functionalities are provided by any builder. Then the 
control is a simple loop. Algorithm 0] shows this loop which contains also the 
whole interface of the Builder. 

Algorithm 4 CRA-Control 



_BiiiWe7-.initialize(); 

while BuiWer.notTerminatedO do 

p :— BuiWer.nextCoPrimeO; 

V := _B/acfc_Boa;. apply (p); 

Builder. wpdsLte{v,p)\ 
end while 
Return i?Mi/(ier.reconstruct(); 



LinBox-1.1.7 gives an implementation of such a controller, parametrized by 
a builder and a black box function as the class lChineseRemainderl in |linbox/algorithms/cra-domain.h[ 

The interface of a controller is to be a function class. It contains a constructor 
with a builder as argument and the functional operator taking as argument a 
BlackBox, computing e.g. a determinant modulo to, and a moduli generator and 
returning an integer reconstructed from the modular computations. Algorithm 
[5]shows the specifications of the LinBox-1.1.7 controller. Then any higher-level 
algorithm just choose its builder and its controller and pass them the modular 
BlackBox iteration it wants to lift over the integers. 

3.3 Integer builders 

The role of the builder is to implement the interface defined by algorithm [H 

There are already three of these implementations in LinBox-1.1.7: an early 
terminated for a single residue, an early terminated for a vector of residues 
and a deterministic for a vector of residues (resp. the files era-early- single . h^ 
|cra-early- 1 |mult ip . h| and |cra-full-multip.h| in the [linbox/algorithms | direc- 
tory). Up to now the radix ladder is not a separated class as only this data 



Algorithm 5 C++ ChineseRemainder class 



1 template<class Builder> struct ChineseRemainder { 

2 ChineseRemainder ( const Builder^ b) builder_(b) {} 

3 template<class Function> integer^ operator () ( 

4 I nteger & res , 

5 const Function &i Blacl<Box) { 

6 // CRA-Control . . . 

^ } 

8 const Builder^ getBuilder() { return builder.; } 

9 protected: Builder builder.; 

10 }; 



structure is currently used and as it is simple enough to inherit from one of the 
latter and modify the behavior of the methods. 

Actually [Ear lyMult ipCRA| inherits from both |EarlySingleCRA| and |FullMultipCRA| 
as it uses the radix ladder of FullMultipCRA for its reconstruction and the early 
termination of |EarlySingleCR A to test a linear combination of the residues to be 
reconstructed as shown on figure [2] The | FullMultipCRA| has been implemented 
so that when a vector/matrix is reconstructed the moduli and some computa- 
tions arc shared among the ladders. Wc give more implementation details on 
the early termination strategies in sections S] and [5] 

3.4 Mappers and binders 

To further enhance genericy, the mapping of between integer and field operations 
can also be automatized. If the data structure storing the matrix disposes of 
binder adaptors generic mappers can be designed. This is the case for the sparse 
and dense matrices of linbox and a generic converter, using the Givaro/Lin- 
Box fields init and convert converters, can be found in linbox/f ield/hom.h| 
|linbox/algorithiii/matrix-hom.h[ 

Then, to map any function class to the field representation one can use the 
following generic mapper: 

An example of the design usage, here computing a determinant via Chinese 
remaindering, is then simply: 

4 Termination strategies 

We sketch here several termination strategies and show that our design enables 
to modify this strategy and only that while the rest of the implementation is 
unchanged. 
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Figure 2: Early termination of a vector of residues via a linear combination 



4.1 Deterministic strategy 

There Full * C RA.upda.te{v , p) just adds the residues to the ladder ; where 
Full * Ci?^. notTerminated() tests if the product of primes so far exceeds the 
precomputed deterministic bound. 



4.2 Earliest termination 

In a sequential mode, depending on the actual speed of the different routines 
of table [T] on a specific architecture or if the cost of Black Box. apply is largely 
dominant, one can choose to test for termination after each call to the black 
box. A way to implement the probabilistic test of |12[ Lemma 3.1] and to reuse 
every black box apply is to use random primes as the moduli generator. Indeed 
then the probabilistic check can be made with the incoming black box residue 
computed modulo a random prime. The reconstruction algorithm of section [3] 
is then only slightly modified as shown in algorithm [5] and the termination test 
becomes simply algorithm [9] 

In the latter algorithm, EarlyT erminationThreshold is the number of suc- 
cessive stabilizations required to get a probabilistic estimate of failures. It will 
be denoted FT for the rest of the paper. This is the strategy implemented in 
LinBOX-1.1.7 in, linbox/algorithms/cra-early- single. h} With the estimates of 



Algorithm 6 C++ Mapper class 



1 template<class Data, class Function> 

2 Struct Mapper { 

3 Mapper( const Data &b , const Function^ h) 

: A(b), g(h) {} 

5 

6 template<class Field> typename F iel d : : Element^ 

7 operator() (typename Field :: Element^ d, 

8 const Field& F) const { 

9 typename Data :: template rebind <Field >:: other Ap ; 

10 Homomorphism : : map(Ap , this— >A, F); 

11 return this >g( d, Ap ); 

12 } 

13 

14 protected: const Data& A; const Function^ g; 

15 }; 



Algorithm 7 C++ Chinese remaindering scheme 



1 // [..-] builder, matrix initializations etc. 

2 Chine5eRemainder<ModularEarlySingleCRA> ETcra( Builder); 

3 Mapper<Spa rseMatrix<lnteger >,Determinant> BBox(A, Det ) ; 

4 Integer d ; 



ETcra( d , BBox) ; // Call to the controller 



table [U the cost of the whole reconstruction of algorithm [4] thus becomes 

t 
^ (apply + 8i + 0(1)) = 

{t + £:r)apply + 4(i + ETf + 0{t) (2) 

where t — [logj^ {R)~\ and /? is the word size. 

This strategy enables the least possible number of calls to BlackBox. apply. 
It it thus useful when the latter dominates the cost of the reconstruction. 

4.3 Balanced termination 

Another classic case is when one wants to use fast integer arithmetic for the 
reconstruction. Then the balanced computations are mandatory and the radix 
ladder becomes handy. The problem now becomes the early termination. There 
a simple strategy could be to test for termination only when the number of 
computed residues is a power of two. In that case the reconstruction is guaran- 
teed to be balanced and fast Chinese remaindering is also guaranteed. Moreover 
random moduli are not any more necessary for all the residues, only those test- 
ing for early termination need be randomly generated. This induces another 



Algorithm 8 EARLYSiNGLECRA.update(u,p) 



Global: U = R mod M. 

Global: A variable Stabilization initially set to 0. 

Input: V = R mod p. 

Output: Rmn = R mod M x p. 

I: u = U mod p; 

2: if u ~= V then 

3: Increment Stabilization; 

4: Return (C/,M xp); 

5: else 

6: Stabilization = 0; 

7: Return Reconstruct(C/ modAf, u mod p); 

8: end if 

Algorithm 9 EARLYSiNGLECRA.notTerminated() 
1: Return Stabilization < EarlyTerminationThreshold; 



saving if one fixes the other primes and precomputes all the factors A/,; x (Afj 
mod A/i+i). There the cost of the reconstruction drops by a factor of 2 from 
2{ma + da)l°' to (m„ + d„)P. 

The drawback is an extension of the number of black box applications from 
[log2fi (^)l + ET to the largest power of two immediately superior and thus up 
to a factor of 2 in the number of black box applies. For the Builder, the update 
becomes just a push in the ladder as shown on algorithm 1101 

Algorithm 10 EARLYBALANCEDCRA.update(w,p) 
1: RADixLADDER.insert(w,p); 

The termination condition, on the contrary tests only when the number of 
residues is power of two as shown on algorithm 1111 

Then, the whole reconstruction of algorithm |4] now requires: 



ET ■ (apply + 3 • 2'=) + ^ -^ (apply + (m„ + d„)2-) 



2' 

i=0 



if + k + ET-l)- apply + (2'^)" ^^^±^ + 0(2^) 



-(apply + 3 -2*) 

2" 



(3) 



operations, where now k = [Iog2(log2^(i?))]. 

Despite the augmentation in the number of black box applications, the latter 
can be useful, in particular when multiple values are to be reconstructed. 

Example 1. Consider the Gauflian elimination of an integer matrix where all 
the matrix entries are larger than n and bounded in absolute value by Aoo- Let 
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Algorithm 11 EARLYBALANCEDCRA.notTerminated() 



1 

2 

3 

4 

5 

6 

7; 

8 

9 

10 

11 

12 



if Only one Shelf, Shelf [i], is full then 
Set Ui to Shclf[i] residue; 

for J = 1 to EarlyTerminationThreshold do 
p :=:PrimeGenerator(); 
if {Ui mod p) ! = B lack Box. apply (p) then 

Return false; 
end if 
end for 
Return true; 
else 

Return false; 
end if 



fltx) = log23(^oo) a'^rf suppose one would like to compute the rational coefficients 
of the triangular decomposition only by Chinese remaindering (there exist better 
output dependant algorithms, see e.g. \2S^ . but usually with the same worst-case 
complexity). Now, Hadamard bound gives that the resulting numerators and 
denominators of the coefficients are bounded by y^nA^ . Then the complexity of 
the earliest strategy would be dominated by the reconstruction where the balanced 
strategy or the hybrid strategy of figure\^ could benefit from fast algorithms: 



EarlySingleCRA 
|:arlyMultipCRA 
"EarlyBalancedCRA' 




Table 2: Early termination strategies complexities for Chinese remaindered 
Gaufiian elimination with rationals 

In the case of small matrices with large entries the reconstruction dominates 
and then a balanced strategy is preferable. Now if both complexities are com- 
parable it might be useful to reduce the factor of 2 overhead in the black box 
applications. This can be done via amortized techniques, as shown next. 

4:.4: Amortized termination 

A possibility is to use the p-amortized control of [5]: instead of testing for 
termination at steps 2^, 2^, . . ., 2% ... the tests are performed at steps p^^^\ 
p^^'^\ . . ., p^^^\ . ■ . with 1 < p <2 and g satisfies Vi, g{i) < i. If the complexity 
of the modular problem is C and the number of iterations to get the output is 
b, [2 give choices for p and g which enable to get the result with only b + ^^ 
iterations and extra 0{f{b)) termination tests where f{b) ~ logpifj). 

In example [T] the complexity of the modular problem is n" , the size of the 
output and the number of iterations is na^o so that strategy would reduce 
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the iteration complexity from 271""*" ^Ooo to (nooo + o{naoo))n'^ and the overaU 
complexity would then become: 



|EarlyAinortizedCRA| 


+ log(naoo)n"a^) 



Indeed, we suppose that the amortized technique is used only on a linear 
combination, and that the whole matrix is reconstructed with a |FullMultipCRA[ 
as in figure [5] Then the linear combination has size 2 log(n) + n ■ a^o which is 
still 0{n ■ aoo)- Nonetheless, there is an overhead of a factor log(naoo) in the 
linear combination reconstruction since there might be up to O(log(naoo)) values 
psM ^ p9(i+i)^ ^ _ between any two powers of two. Overall this gives the above 
estimate. Now one could use other g functions as long as eq. |4]is satisfied. 

r(p9('+l) _p9W) =o(p9W) 
|(p9(»+feW) _p9W) ^2ri°g2(p'"")l, fc(j) =o(pS«) ^ ' 

5 Parallelization 

All parallel versions of these sequential algorithms have to consider the parallel 
merge of radix ladders and the parallelization of the loop of the CRA-control 
algorithm H) Many parallel libraries can be used, namely OpenMP or Cilk 
would be good candidates for the parallelization of the embarrassingly parallel 
|FullMultipCRA| Now in the early termination setting, the main difficulty comes 
from the distribution of the termination test. Indeed, the latter depends on data 
computed during the iterations. To handle this issue we propose an adaptive 
parallel algorithm O [24] and use the Kaapi library [6l [16] . Its expressiveness 
in an adaptive setting guided our choice, together with the possibility to work 
on heterogenous networks. 

5.1 Kaapi overview 

Kaapi is a task based model for parallel computing. It was targeted for dis- 
tributed and shared memory computers. The scheduling algorithm uses work- 
stealing O [U [31 [17] : an idle processor tries to steal work to a randomly selected 
victim processor. 

The sequential execution of a Kaapi program consists in pushing and pop- 
ping tasks to dequeue the current running processor. Tasks should declare the 
way they access the memory, in order to compute, at runtime, the data flow 
dependencies and the ready tasks (when all their input values are produced). 
During a parallel execution, a ready task, in the queue but not executed, may 
be entirely theft and executed on an other processor (possibly after being com- 
municated through the network). These tasks are called dfg tasks and their 
schedule by work-stealing is described in [161 [l2] ■ 

A task being executed by a processor may be only partially theft if it interacts 
with the scheduler, in order to e.g. decide which part of the work is to be given 
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to the thieves. Such tasks are called adaptive tasks and allows fine grain loop 
parallelism. 

To program an adaptive algorithm with Kaapi, the programmer has to spec- 
ify some points in the code (using |kaapi_stealpointp or sections of the code 
( |kaapi_'s tealbegin, kaapi_stealend') where thieves may steal work. To guar- 
antee that parallel computation is completed, the programmer has to wait for 
the finalization of the parallel execution (using k aapi_steal_f inalize] ). More- 
over, in order to better balance the work load, the programmer may also decide 
to preempt the thieves (send an event via |kaapi_preenipt_nextl ). 

5.2 Parallel earliest termination 

Algorithm 1121 lets thieves steal any sequence of primes. 

Algorithm 12 ParallelCRA-Control 
1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 



Builde7\initialize{); 
while BuiWer.notTerminatedO do 
p := _BMiWer.nextCoPrime(); 
kaapi_stealbegin( splitter, Builder); 
V :— BlackBox.apply{p); 
Builder .update{v , p); 
kaapi_finalize_steal() ; 
kaapi_stealend ( ) ; 
if require synchronization step then 
while kaapi_nomore_thief do 

{list of v,list of p) :=kaapi_preempt_next(); 
Builder .wpAsiteilist of v,list of p); 
end while 
end if 
end while 
Return i3Mi/der.reconstruct(); 



At lineHl the code allows the scheduler to trigger the processing of steal requests 
by calling the splitter function. The parameters of |kaapi_stealbegiri] are the 
splitter function and some arguments to be given to its call. These argument^ 
can e.g. specify the state of the computation to modify (here the builder object 
plays this role). Then, on the one hand, concurrent modifications of the state of 
computation by thieves, must be taken care of during the control flow between 
lines m and m here the computation of the residue could be evaluated by multiple 
threads without critical sectior|3. On the other hand, after line [51 the scheduler 
guarantees that no concurrent thief can modify the computational state when 
they steal sonic work. Remark that both branches of the conditional ifj at line[9] 



*in or out 

^This depends on the implementation, most of the LinBox library functions are reentrant 
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must be executed without concurrency: the iteration of the hst of thieves or the 
generation of the next random modulus are not reentrant. 

The role of the splitter function is to distribute the work among the thieves. 
In algorithm ll31 each thief receives a lcoPrimeGeneratorl obiect and the entrypoint 
to execute. 

Algorithm 13 SPLnTER{Builder, N,requests[]) 
1: for i = to iV - 1 do 

2: kaapi_request_reply(re(7west[i], entrypoint, 

Builder. getCoPrimeGeneratorQ ); 
3: end for 

The IcoPrimeGeneratorl depends on the Builder] type and allows the thief 
to generate a sequence of moduli. For instance the coPrimeGenerator for the 
earliest termination contains at one point a single modulus AI which is returned 
by the next call of InextCoPrime Q] by the lBuilderl 

The splitter function knows the number N of thieves that are trying to steal 
work to the same victim. Therefore it allows for a better balance of the work 
load. This feature is unique to Kaapi when compared to other tools having a 
work-stealing scheduler. 

5.3 Synchronization 

Now, the victim periodically tests the global termination of the computation 
(line[9lof algorithm [T2)) . Depending on the chosen termination method ( |Early*CRA[ 
etc.), the synchronization may occur at every iteration or after a certain num- 
ber of iterations. The choice is made in order to e.g. amortize the cost of this 
synchronization or reduce the arithmetic cost of the reconstruction. Then each 
thief is preempted (line llip and the code recovers its results before giving them 
to the Builder for future reconstruction (line [T^ . 

The preemption operation is a two way communication between a victim and 
a thief: the victim may pass parameters and get data from one thief. Note that 
the preemption operation assumes cooperation with the thief code. The latter 
being responsible for polling incoming events at specific points (e.g. where the 
computational state is safe preemption- wise) . 

On the one hand, to amortize the cost of this synchronization, more primes 
should be given to the thieves. In the same way, the victim code works on a 
list of moduli inside the critical section (at line [3] returns a list of moduli, and 
at lines [5]|6] the victim iterates over this list by repeatedly calling apply and 
|update| methods) . On the other hand, to avoid long waits of the victim during 
preemption, each thief should test if it has been preempted to return quickly its 
results (see next section). 
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5.4 Thief entrypoint 

Finally, algorithm [T3] returns both the sequence of residues and the sequence 
of primes that where given to the BlackBox. This algorithm is very similar to 
algorithm \T% 

Algorithm 14 Thief's EntryPoint(M) 

1: Bui/der. initialize(); 

2: list of w.clearO; 

3: list of p.clear(); 

4: while _BiM?(ier.CoPrimeGenerator() not empty do 

5: if kaapi_preemptpoint() then break; end if 

6: p :~ Builder.nextCoPrimeQ; 

7: kaapi_stealbegin( splitter, Builder); 

8: list of p.push_back(p); 

9: list of ?j.push_back(B/acfc_Boa;.apply(p)); 
10: kaapi_stealend(); 
11: end while 
12: kaapi_stealreturn {list of v,list of p); 

Lines [7] and [10] define a section of code that could be concurrent with steal 
requests. At line \E\ the code tests if a preemption request has been posted by 
algorithm [12] at line [11] If this is the case, then the thief aborts any further 
computation and the result is only a partial set of the initial work allocated by 
the splitter function. 

5.5 Efficiency 

These parallel versions of the Chinese remaindering have been implemented 
using Kaapi transparently from the LinBox library: one has just to change 
the sequential controller |cra-domain . h| to the parallel one. 

In LinBox-1.1.7 some of the sequential algorithms which make use of some 
Chinese remaindering are the determinant, the minimal/characteristic polyno- 
mial and the valence, see e.g. [20] [12] [H] [8] for more details. 

We have performed these preliminary experiments on an 8 dual core machine 
(Opteron 875, 1MB L2 cache, 2.2Ghz, with SOGBytes of main memory). Each 
processor is attached to a memory bank and communicates to its neighbors via 
an hypertransport network. We used g-|— I- 4.3.4 as C-\ — h compiler and the Linux 
kernel was the 2.6.32 Debian distribution. 

All timings are in seconds. In the following, we denote by T^eq the time of 
the sequential execution and by Tp the time of the parallel execution for p = 8 
or p = 16 cores. All the matrices are from "Sparse Integer Matrix Collection" 
(SIMCB 



^http : //I j k . Imag . f r/CASYS/SIMC] 
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Table[3]gives the performance of the parallel computation of the determinant 
for small invertible matrices (less than a second) and larger ones (an hour CPU) 
of the |SIMC/SPG| and |SIMC/Tref ethen| collections. 



Matrix 


d,r 


J- seq K 


Tp=8k 


Tp=i6 k 


ex — 1 
ex -3 


560, 8736 
2600, 71760 


0.29 [4] 
837.80[184] 


0.16[9] 
123.56[193] 


0.22(16.8] 
77.99(193] 


t-150 
i-300 
i-500 
i-700 
t - 2000 


150, 2040 
300, 4678 
500, 8478 
700, 12654 
2000, 41907 


0.21[59] 

2.52[138] 

15.19[249] 

52.59[367] 

2978.23[1274] 


0.046(63.4] 

0.36(144.8] 

2.05(257] 

6.50(368.9] 

384.43(1281] 


0.036(63.6] 
0.24(144.7] 
1.31(256.3] 
4.19(371.2] 
236.59(1281] 



Table 3: Timings for the computation of the determinant, d is the dimension 
of the matrix, r the number of non-zero coefficients, [k] is the mean number of 
primes observed for the Chinese remaindering using p cores. 

The small instance (ex-1) needed very few primes to reconstruct integer the 
solution. There, we can see the overhead of parallelism: this is due to some 
extra synchronizations and also to the large number of unnecessary modular 
computations before realizing that early termination was needed. Despite this 
we do achieve some speed-up. 

We show on table |4] the corresponding speed-ups of table [3] compared with 
a naive approach using OpenMP: for p the number available cores, launch the 
computations by blocks of p iterations and test for terminaison after each block 
is completed. For large computations the speed-up is quite the same since the 



Matrix 


ex — 1 


ex -3 


t-150 


i-300 


i-500 


i-700 


t - 2000 


Naive 
AlgM 


1.38 

1.35 


10.66 
10.74 


2.10 
5.78 


8.52 
10.52 


11.29 
11.56 


12.55 
12.55 


12.48 
12.59 



Table 4: Speed-up using 16 cores of algorithm [12] compared to a naive approach 
with OpenMP 

computation is largely dominant. For smaller instances we see the advantage of 
reducing the number of synchronizations. On e.g. multi-user environments the 
advantage should be even greater. 

6 Conclusion 

Wc have proposed a new data structure, the radix ladder, capable of managing 
several kinds of Chinese reconstructions while still enabling fast reconstruction. 
Then, we have defined a new generic design for Chinese remaindering schemes. 
It is summarized on figure [31 Its main feature is the definition of a builder 
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interface in charge of the reconstruction. This interface is such that any of ter- 
mination (deterministic, early terminated, distributed, etc.) can be handled by 
a CRA controller. It enables to define and test remaindering strategies while 
being transparent to the higher level routines. Indeed we show that the Chinese 
remaindering can just be a plug-in in any integer computation. 



BlackBoxes 
Determinant 
Minpoly ^ 

Valence Je^ 



Mapper 



Binder 
Adaptors 



Controllers 



CRA-Control 
Parallel-CRA 

SpliUer 
ThielEnliyPoint 



Builders 



Eariy " CRA 
FuilMultipCRA 



->■ RaddlxLadder 



Figure 3: Generic Chinese remaindering scheme 

We also provide in LinBox-1.1.7 an implementation of the ladder, several 
implementations for different builders and a sequential controller. Then we 
tested the introduction of a parallel controller, written with Kaapi, without 
any modification of the LinBox library. The latter handles the difficult issue of 
distributed early termination and shows good performance on a SMP machine. 

In parallel, some improvement could be made to the early termination strat- 
egy in particular when the BlackBox is fast compared to the reconstruction and 
when balanced and amortized techniques are required. Also, output sensitive 
early termination is very useful for rational reconstruction, see e.g. [21] and 
thus the latter should benefit from this kind of design. 
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